WASSUP? LOL : Characterizing Out-of-Vocabulary Words in Twitter

نویسندگان

Suman Kalyan Maity

Anshit E. Chaudhary

Shraman Kumar

Animesh Mukherjee

Chaitanya Sarda

Abhijeet Patil

Akash Mondal

چکیده

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). CSCW’16 Companion, February 27 March 02, 2016, San Francisco, CA, USA ACM 978-1-4503-3950-6/16/02. http://dx.doi.org/10.1145/2818052.2869110 Abstract Language in social media is mostly driven by new words and spellings that are constantly entering the lexicon thereby polluting it and resulting in high deviation from the formal written version. The primary entities of such language are the out-of-vocabulary (OOV) words. In this paper, we study various sociolinguistic properties of the OOV words and propose a classification model to categorize them into at least six categories. We achieve 81.26% accuracy with high precision and recall. We observe that the content features are the most discriminative ones followed by lexical and context features.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both ...

متن کامل

Crowd Sentiment Detection during Disasters and Crises

Microblogs are an opportunity for scavenging critical information such as sentiments. This information can be used to detect rapidly the sentiment of the crowd towards crises or disasters. It can be used as an effective tool to inform humanitarian efforts, and improve the ways in which informative messages are crafted for the crowd regarding an event. Unique characteristics of microblogs (lack ...

متن کامل

Mining Twitter for New Words

New lexical elements such as LOL are appearing in natural digital language at high frequencies. The usage of these elements suggests that they are being treated like real words. The first step in examining this type of element is to identify them. We gathered 2,798 messages within a 10-mile radius of a specific GPS location for a 10.5 hour period. The novel elements were identified by excluding...

متن کامل

Word Normalization in Twitter Using Finite-state Transducers

This paper presents a linguistic approach based on weighted-finite state transducers for the lexical normalisation of Spanish Twitter messages. The system developed consists of transducers that are applied to out-of-vocabulary tokens. Transducers implement linguistic models of variation that generate sets of candidates according to a lexicon. A statistical language model is used to obtain the m...

متن کامل

Review of Twitter sentiment analysis

Twitter data has recently been considered to perform a large variety of advanced analysis. Analysis of Twitter data imposes new challenges because the data distribution is intrinsically sparse, due to a large number of messages post every day by using a wide vocabulary. Sentiment Analysis task is divided in two steps: Feature selection methods and Sentiment classification methods. Feature selec...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

WASSUP? LOL : Characterizing Out-of-Vocabulary Words in Twitter

نویسندگان

چکیده

منابع مشابه

Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Crowd Sentiment Detection during Disasters and Crises

Mining Twitter for New Words

Word Normalization in Twitter Using Finite-state Transducers

Review of Twitter sentiment analysis

عنوان ژورنال:

اشتراک گذاری